Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams
نویسندگان
چکیده
Computer-supported authorship attribution provides tools for extracting stylistic features that can help verify or identify the author of text documents. In many situations finding the author of a document is very important, such as the detection of plagiarism for protecting copyrights and forensic support during criminal investigations. This paper, thus explores a novel stylistic feature with the aim of accurately characterizing an author’s work. In particular, the use of part-of-speech skip-grams and an in-house top-k sequential pattern mining algorithm are considered for the task of authorship attribution. A study using a collection of of 30 texts, written by 10 authors, consisting of 2, 615, 856 words and 99, 903 sentences, confirms that mining part-of-speech skip-grams in texts facilitates authorship inference.
منابع مشابه
CNG Method with Weighted Voting
CNG Method for Authorship Attribution. The Common N-Grams (CNG) classification method for authorship attribution (AATT) was described in [2]. The method is based on extracting the most frequent byte n-grams of size n from the training data. The n-grams are sorted by their normalized frequency, and the first L most-frequent n-grams define an author profile. Given a test document, the test profil...
متن کاملEnhancing Authorship Attribution By Utilizing Syntax Tree Profiles
The aim of modern authorship attribution approaches is to analyze known authors and to assign authorships to previously unseen and unlabeled text documents based on various features. In this paper we present a novel feature to enhance current attribution methods by analyzing the grammar of authors. To extract the feature, a syntax tree of each sentence of a document is calculated, which is then...
متن کاملLocal Histograms of Character N-grams for Authorship Attribution
This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for ...
متن کاملFeature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches
This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and deba...
متن کاملLocal n-grams for Author Identification Notebook for PAN at CLEF 2013
Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year’s competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author’s writin...
متن کامل